-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New job queue: worker registration and leader election #3307
base: main
Are you sure you want to change the base?
Conversation
Deploying matrix-authentication-service-docs with Cloudflare Pages
|
48e5507
to
e419853
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.
I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.
-- The leader is responsible for running maintenance tasks | ||
CREATE UNLOGGED TABLE queue_leader ( | ||
-- This makes the row unique | ||
active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it sounds silly, but I'd make this a PRIMARY KEY
— maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.
// If no row was updated, the worker was shutdown so we return an error | ||
DatabaseError::ensure_affected_rows(&res, 1)?; | ||
|
||
Ok(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring says this returns the modified worker, but I don't see us modifying it.
I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?
clock: &dyn Clock, | ||
threshold: Duration, | ||
) -> Result<(), Self::Error> { | ||
let now = clock.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it reasonable to rely on the system clock (which could drift between servers)?
I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?
The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now
e419853
to
1370a04
Compare
2f19fff
to
80aa6fa
Compare
80aa6fa
to
f060abe
Compare
We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.
f060abe
to
76afd6a
Compare
This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.
The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.
Part of #2785